First things first... Let's read the CSV files and take a look at what's inside them
# reading the CSVs
import pandas as pd
daily_device_counts = pd.read_csv('../data/daily_device_counts.csv.gz', compression='gzip')
fleet_metadata = pd.read_csv('../data/fleet_metadata.csv.gz', compression='gzip')
# checking daily_device_counts data
daily_device_counts.head()
| uuid | date | devices | |
|---|---|---|---|
| 0 | 8f14e45fceea167a5a36dedd4bea2543 | 2016-05-26 | 1 |
| 1 | 8f14e45fceea167a5a36dedd4bea2543 | 2016-05-27 | 1 |
| 2 | 8f14e45fceea167a5a36dedd4bea2543 | 2016-05-28 | 1 |
| 3 | 8f14e45fceea167a5a36dedd4bea2543 | 2016-05-29 | 1 |
| 4 | 8f14e45fceea167a5a36dedd4bea2543 | 2016-05-30 | 1 |
No surprises here... As the description provided:
devices)date)uuid)For someone not very familiar with the dataset (like me at this point) making the column names very self explanatory help later on when doing the analysis (specially when you have multiple data sets involved). So I'll rename some of the variables so it's obvious what they represent.
# renaming column names
daily_device_counts = daily_device_counts.rename(
columns = {
'devices': 'devices_count',
'uuid': 'fleet_id'
}
)
# checking daily_device_counts dataframe
daily_device_counts.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 80407 entries, 0 to 80406 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fleet_id 80407 non-null object 1 date 80407 non-null object 2 devices_count 80407 non-null int64 dtypes: int64(1), object(2) memory usage: 1.8+ MB
Apparently no NULL values... Let's double check and make sure there are no strings that actually represent NULLS.
# checking number of characters for all fleet_id
daily_device_counts['fleet_id'].str.len().unique()
array([32])
All IDs have 32 characters, so that good indication that the IDs are fine. Let's now try to convert date to a datetime and see if there is any issue with that.
# converting 'date' column to datetime data type
daily_device_counts['date'] = pd.to_datetime(daily_device_counts['date'])
daily_device_counts.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 80407 entries, 0 to 80406 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fleet_id 80407 non-null object 1 date 80407 non-null datetime64[ns] 2 devices_count 80407 non-null int64 dtypes: datetime64[ns](1), int64(1), object(1) memory usage: 1.8+ MB
Perfect... We converted date to a datetime type and still no NULLs. Also, let's make sure each combination of fleet_id+date only appears once. As you can see, no problems with that:
# show duplicates
daily_device_counts[daily_device_counts[['fleet_id','date']].duplicated()]
| fleet_id | date | devices_count |
|---|
# checking fleet_metadata data
fleet_metadata.head()
| uuid | device_type | industry | |
|---|---|---|---|
| 0 | 8f14e45fceea167a5a36dedd4bea2543 | Beaglebone | Agriculture |
| 1 | 6512bd43d9caa6e02c990b0a82652dca | Raspberry Pi | Consumer & Home Automation |
| 2 | c74d97b01eae257e44aa9d5bade97baf | Raspberry Pi | Transportation |
| 3 | 70efdf2ec9b086079795c442636b55fb | Raspberry Pi | Building Automation |
| 4 | 37693cfc748049e45d87b8c7d8b9aacd | Raspberry Pi | Monitoring |
Again, no surprises here. As the description provided:
device_type)industry)uuid)I'll rename the uuid to fleet_id for consistency and check for NULLs.
# renaming column names
fleet_metadata = fleet_metadata.rename(
columns = {
'uuid': 'fleet_id'
}
)
# checking fleet_metadata dataframe
fleet_metadata.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 95 entries, 0 to 94 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fleet_id 95 non-null object 1 device_type 95 non-null object 2 industry 95 non-null object dtypes: object(3) memory usage: 2.4+ KB
# checking number of characters for all fleet_id
fleet_metadata['fleet_id'].str.len().unique()
array([32])
All good with the fleet_id. Let's take a look at device_type and industry:
# counting the combination of device_type and industries
fleet_metadata.groupby(['device_type','industry']).size()
device_type industry
Beaglebone Agriculture 2
Consumer & Home Automation 1
Manufacturing & Logistics 2
Retail & Signage 1
Intel NUC Construction 1
Manufacturing & Logistics 3
Retail & Signage 2
Transportation 2
Raspberry Pi Agriculture 4
Apparel 1
Building Automation 12
Clothing 1
Consumer & Home Automation 13
Education 1
Energy 4
Energy Generation 1
Industrial 1
Manufacturing & Logistics 11
Medical 3
Monitoring 1
Retail & Signage 21
Robotics & Drones 2
Smart Home 1
Telecommunications 1
Telecoms 1
Transportation 2
dtype: int64
Perfect! Besides validating the data set for NULLs, we can already notice that most fleets are using Raspberry Pi and also the fact that some industries could perhaps be merged together (eg. Telecommunications and Telecoms).
I'll quickly make that adjustment, but ideally, this kind of data innacuracy would be handled at the source if possible. So in a real life scenario, I'd investigate how the data was generated and try to evaluate if any process/system improvement could be made.
# merging together some industries
fleet_metadata = fleet_metadata.replace({
'Energy Generation': 'Energy',
'Telecoms': 'Telecommunications',
'Smart Home': 'Consumer & Home Automation'
})
fleet_metadata.groupby(['industry']).size()
industry Agriculture 6 Apparel 1 Building Automation 12 Clothing 1 Construction 1 Consumer & Home Automation 15 Education 1 Energy 5 Industrial 1 Manufacturing & Logistics 16 Medical 3 Monitoring 1 Retail & Signage 24 Robotics & Drones 2 Telecommunications 2 Transportation 4 dtype: int64
One final check is to make sure each fleet_id appears only once. And again, no issues as you can see:
# show duplicates
fleet_metadata[fleet_metadata['fleet_id'].duplicated()]
| fleet_id | device_type | industry |
|---|
Since we have no duplicates and each fleet_id is unique in fleet_metadata, I'll merge both data sets:
# merge both data sets and rename fields
daily_device_counts = pd.merge(daily_device_counts, fleet_metadata, how='left', validate='many_to_one')
daily_device_counts = daily_device_counts.rename(
columns = {
'device_type': 'devices_type',
'industry': 'fleet_industry'
}
)
daily_device_counts.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 80407 entries, 0 to 80406 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fleet_id 80407 non-null object 1 date 80407 non-null datetime64[ns] 2 devices_count 80407 non-null int64 3 devices_type 80407 non-null object 4 fleet_industry 80407 non-null object dtypes: datetime64[ns](1), int64(1), object(3) memory usage: 3.7+ MB
The first question I have on my mind is regarding the size of the fleets and how that's evolving over time.
import plotly.express as px
import plotly.offline as pyo
# Set notebook mode to work in offline
pyo.init_notebook_mode()
# plotting the graph
fig = px.line(daily_device_counts, x='date', y='devices_count', color='fleet_id', title='Number of devices per fleet')
fig.update_layout(showlegend=False)
pyo.iplot(fig, filename = 'number_of_devices_per_fleet_over_time')
Clearly, there are some fleets that are outliers and quite big compared to the other ones. Let's investigate that a bit more.
# identifying fleets that reached high devices_count
daily_device_counts.groupby('fleet_id').devices_count.max().sort_values(ascending=False)
fleet_id
f0f717dbbabbabab1a26beaf06fc3438 15747
7314a789d9c965494dc1b1c371d120e9 3949
736d0469753aa6d375ef9679ef3465fb 2754
9f61408e3afb633e50cdf1b20de6f466 1806
6364d3f0f495b6ab9dcf8d3b5c6e0b01 1663
...
218344619d8fb95d504ccfa11804073f 2
6a7c8b3c6b0ecb1b72233e98c09793f4 2
68d30a9594728bc39aa24be94b319d21 1
cf9b2d0406020c56599f9a93708832b5 1
f899139df5e1059396431415e770c6dd 1
Name: devices_count, Length: 95, dtype: int64
# identifying some outliers
ids_to_remove = daily_device_counts.groupby('fleet_id') \
.devices_count.max() \
.sort_values(ascending=False) \
.head(5) \
.index \
.to_list()
daily_device_counts_filtered = daily_device_counts[~daily_device_counts['fleet_id'].isin(ids_to_remove)]
# plotting the graph (removing some outliers)
fig = px.line(daily_device_counts_filtered, x='date', y='devices_count', color='fleet_id', title='Number of devices per fleet (outliers excluded)')
fig.update_layout(showlegend=False)
pyo.iplot(fig, filename = 'number_of_devices_per_fleet_filtered')
Now that we removed some outliers, we can see things like:
Here I isolated the outliers and, again, investigating the reason for these spikes can generate valuable insights. I'd be also curious to identify the fleet that has over 15k devices:
fleet_metadata[fleet_metadata['fleet_id'].isin(ids_to_remove)]
| fleet_id | device_type | industry | |
|---|---|---|---|
| 8 | 6364d3f0f495b6ab9dcf8d3b5c6e0b01 | Raspberry Pi | Building Automation |
| 19 | 9f61408e3afb633e50cdf1b20de6f466 | Raspberry Pi | Energy |
| 83 | 736d0469753aa6d375ef9679ef3465fb | Raspberry Pi | Robotics & Drones |
| 90 | f0f717dbbabbabab1a26beaf06fc3438 | Raspberry Pi | Building Automation |
| 92 | 7314a789d9c965494dc1b1c371d120e9 | Raspberry Pi | Retail & Signage |
# plotting outliers
daily_device_counts_outliers = daily_device_counts[daily_device_counts['fleet_id'].isin(ids_to_remove)]
# plotting the graph (removing some outliers)
fig = px.line(daily_device_counts_outliers, x='date', y='devices_count', color='fleet_id', title='Number of devices per fleet (outliers only)')
fig.update_layout(showlegend=False)
pyo.iplot(fig, filename = 'number_of_devices_per_fleet_outliers')
It's already obvious, how the distribution of number of devices is skewed and there two things I still want to take a look at:
# plotting the total devices_count over time
total_daily_device_counts = daily_device_counts.groupby('date').devices_count.sum().to_frame().reset_index()
fig = px.line(total_daily_device_counts, x='date', y='devices_count')
pyo.iplot(fig, filename = 'number_of_devices_over_time')
# filtering only last day
device_counts_last_day = daily_device_counts[daily_device_counts['date'] == daily_device_counts['date'].max()].reset_index()
device_counts_last_day
# looking at its distribution
device_counts_last_day['devices_count'].describe()
count 91.000000 mean 378.857143 std 1700.037232 min 1.000000 25% 13.500000 50% 57.000000 75% 189.000000 max 15747.000000 Name: devices_count, dtype: float64
# grouping data per devices_count quantiles
device_counts_last_day['devices_count_group'] = pd.qcut(device_counts_last_day['devices_count'],q=10, precision=0)
device_counts_last_day_agg = device_counts_last_day.groupby(by=['date','devices_count_group']).agg(
devices_count=pd.NamedAgg(column='devices_count', aggfunc='sum'),
fleets_count=pd.NamedAgg(column='devices_count', aggfunc='size')
)
device_counts_last_day_agg
| devices_count | fleets_count | ||
|---|---|---|---|
| date | devices_count_group | ||
| 2019-12-31 | (0.0, 4.0] | 22 | 10 |
| (4.0, 10.0] | 91 | 11 | |
| (10.0, 19.0] | 104 | 7 | |
| (19.0, 35.0] | 236 | 9 | |
| (35.0, 57.0] | 433 | 9 | |
| (57.0, 103.0] | 712 | 9 | |
| (103.0, 146.0] | 1097 | 9 | |
| (146.0, 254.0] | 1788 | 9 | |
| (254.0, 636.0] | 3592 | 9 | |
| (636.0, 15747.0] | 26401 | 9 |
# device_counts_last_day_agg.reset_index().sum('devices_count')
device_counts_last_day_agg['percentage_of_total_devices'] = device_counts_last_day_agg['devices_count'] / device_counts_last_day_agg['devices_count'].sum()
device_counts_last_day_agg
| devices_count | fleets_count | percentage_of_total_devices | ||
|---|---|---|---|---|
| date | devices_count_group | |||
| 2019-12-31 | (0.0, 4.0] | 22 | 10 | 0.000638 |
| (4.0, 10.0] | 91 | 11 | 0.002640 | |
| (10.0, 19.0] | 104 | 7 | 0.003017 | |
| (19.0, 35.0] | 236 | 9 | 0.006845 | |
| (35.0, 57.0] | 433 | 9 | 0.012559 | |
| (57.0, 103.0] | 712 | 9 | 0.020652 | |
| (103.0, 146.0] | 1097 | 9 | 0.031819 | |
| (146.0, 254.0] | 1788 | 9 | 0.051862 | |
| (254.0, 636.0] | 3592 | 9 | 0.104188 | |
| (636.0, 15747.0] | 26401 | 9 | 0.765779 |
Notice how the pareto principle can be applied here. Looking at the most recent date, roughly 77% of all devices come from only 10% of the fleets (9 fleets). Those deserve special attention, since they probably bring most of the company's revenue and the main users of the system.
Based on the information we have, let's try to forecast the number of devices we can expect to observe in the future. I'll plot again the total amount of devices over time and notice how it looks like we can adjust a curve using a polinomial regression:
fig = px.line(total_daily_device_counts, x='date', y='devices_count')
pyo.iplot(fig, filename = 'number_of_devices_over_time')
import datetime as dt
x = total_daily_device_counts['date'].map(dt.datetime.toordinal).values.astype(int).reshape(-1,1)
y = total_daily_device_counts['devices_count'].values.reshape(-1,1)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(degree = 3)
X_poly = poly.fit_transform(x)
reg = LinearRegression()
reg.fit(X_poly, y)
poly_reg = reg.predict(X_poly)
total_daily_device_counts['polynomial_regression_forecast'] = poly_reg
fig = px.line(total_daily_device_counts.melt(id_vars='date'), x='date', y='value', color='variable')
pyo.iplot(fig, filename = 'y_forecast')
Not great, but it does the job... The curve doesn't adjust really well for values before 2018 (it has even negative values) and the error between the real value and the forecasted by the model for the most recent date doesn't make me very happy. I quickly tried to fit a polynomial regression (also 3 degrees) using google-sheets and it gave a me a better model.

If I had to spend more time working on this task: